A new way of classifiying NBA Players

Introduction

If you know me well, you are probably aware that my basketball fandom has been one of my life’s defining attributes. You may also know how I like to keep track of my favorite players’ season-average stats, often crunching numbers in my head after every shot when watching games. But this is the first time I am using an elaborate arsenal of Data Science tools to look at NBA data. Needless to say, I’m very excited about where we can go with this.

About 7 months ago, I stubled upon an article on TheScore which explain that why the tranditional 5 positions are no longer enough to decribe current NBA players. Ever since the rise of Golden State Warriors dynasty lead by Stephen Curry, many NBA teams have trying and dying to replicate the success of the Golden State in recent years. Drafting and Trading guards with promissing shooting skills, turing the NBA into a 'Small Ball' era. As a result, players who is in the traditional foward position have to change their play-style, shooting more threes and picking up more assits to mantain there relevence in this new NBA era. The game has has shifted toward a state of positional hybridism. Therefore, the traditional 5 main positions (Point Guard, Shooting Guard, Small Foward, Power Foward, Center) are no longer sufficient to describe the NBA players in this era. The authors from the articles came up with a way to classigy players in 9 classes, based on the way they pay the game.

In this tutorial, I will take another shot at classifying players in various clusters, depending on what they do on the court. However, I will do it using data science and more precisely the K-Means clustering.

I will also take a deeper look at what makes a winning team, i.e. what type of players should be put together for a team to win an NBA title.

Let's get into it!

Data Scrapping and Preparation

I began by scraping data directly from NBA.com. I will be scrapping the players traditional statistics, and players shooting statistics from the NBA.com. I collected stats for all 529 players that played in the league in 2019-2020 season.

Along with traditional stats (points per game, assists, rebounds, etc.), I also collected stats describing shot location, type of offensive play (drive, iso, etc.) defensive efficiency and usage rate.

Lets try to scrape the players traditional statistics first.

Before scraping the data we have to import some libraries to help us scrape data. The function of each libraries is as follow:

Numpy: help us in operation of data
Pandas: help us store and present data in a tabulated format
BeautifulSoup: help us scrape data from our website

Since, the NBA.com is an dynamic website, we would need an additional library to navigate the website and succefully scrape the data from the website. The library is as follow:

webdriver from selenium: help us navigate and browse dynamic website
Note: Since I want to use Chrome as my webdriver, I would need to download the chromedriver executable as well from https://chromedriver.chromium.org/.

We will first load up the Chrome webdriver and run the appropriate executable. After we have our webdriver running, we can ask the webdriver to get the website by giving it the url of our website. Since by default, the website only display stats for 50 players. Therefore, we have to select the option that would ask the website to display all players. We will have to go through the html page of the website and select the appropriate. This article have explained this step well.

After we have everything displayed on our website, we can use the BeautifulSoup library to scrape the players stats from the website. We first parsed the website into a html page and find the table that stores the players stats. We get the table and store them into a DataFrame using pandas.

After getting the stats table, we can see that there is a number of unnecessary data. We have to select data that is reflected only on the website. The following code fragment select the appropriate columns that is presented on the website.

Next, we do the same for the second website.

There is a bunch of NaN value which need to be removed from the table. And the columns label is pretty chaotic because of the multilevel indexing in the original table. We need to fix this by droping all NaN values and the multilevel index and renaming the columns with readable label.

Here we rename the columns and remove unecessary data from the table which is not present in the original webpage.

Preparing the data

After getting all table, we have to combine them all into a single table using the merge operation. Both table should be a one to one relationship (meaning that for each row in table 1 there is a corresponding row in table 2 and 3). We merge the two table on the players name, team and age. Then, I decided to get rid of players that played less than 12 minutes per game, as I felt classifying players based on how they play when they barely play was no gonnna provide accurate results. That leaves us with a total of 396 players. The resulting table will be output as follow.

Feature creation

I decide to create 3 new variable, describing what percentage of a player's field goal attempts come from where on the court (paint, mid range or 3 point line).

I will first add up the field goal attempt (FGA) of the Restricted Area and Paint as Paint_FGA because they are the closest shots to the rim. Then we add up the field goal attempt of all the shots attempted behind the 3-point line as 3P_FGA.

Next, we can calculate the percentage of field goal attempts from different area of the court as follow:

Close Range Percentage = Paint_FGA / Total_FGA
Mid-Range Percentage = MD_FGA / Total_FGA
Perimeter Percentage = 3P_FGA / Total_FGA

Data exploration

Before we begin clustering the players, lets visualize our data and get some primary estimation on how the players will be clustered together. We will first import the classic visualization library matplotlib to help use in visualizing our data. Next, I am interested to how the shot selection of the top NBA scorer in the 2019-2020 season. Therefore, I've identify the top 50 NBA scorer in the 2019-2020 season by their points scored per game. I will plot scattered graphs for Minutes Played vs Close Range Shot, Minutes Played vs Mid-Range Shot and Minutes Played vs 3point Shot for all players. The top 50 scorers will be marked in red.

We can see that James Harden was the highest scorer in the NBA 2019-2020 season. This make sense that Harden had won the scoring title for this season. It is interseting to see that the top 4 scorer for this season were backcourt players. It is usually frontcourt players who dominated this catagory but we have seen a shift in recent years and backcourt players are outscoring the frontcourt players. Eventhough James Harden had the highest scoring for this season, The Greek Freak (Giannis Antetokounmpo) won the league MVP for a second year in a row because the Freak had the highest +/- (total contribution for the team) that I've seen for the past few years. The Freak had a +/- of 10.8 doubled the Harden's +/- of 4.2.

Going back to our analysis, we will first plot the Minutes Played vs Close Range Shot graph as follow:

Looking at the graph, there is not clear relationship between the Minutes Played and the Percentage of Close Range Shot attempted. However, as we look at the Close range shot attempted by the top 50 scorer, we can see that most of them take plenty of close range shot ranging from 0.10 - 0.25 percent. We can also see that the top 50 scorer had higher minutes played with the rest of the leauge which make sense becaue the longer a player played on the court, the more points the player can score per game technically. However, there is one interesting player that had the least minutes played among the top scorers and yet he is able to excel by taking majority of his shot in the paint. Let's find out who he is by sorting the top 50 scorers list based on their close range shot percentage.

That player was Zion Williamson, the first over all pick of 2019 NBA Draft from Duke University. Zion was known for his flashy dunks and acrobatic finishes at the rim ever since highschool. It is incredible to see his dominance in this category even with only 24 games played and less minutes played per game. Given his young age, he will be dominating the paint area in the future of the leauge.

Next, let's look at the Minutes Played vs Mid-Range Shot graph.

Again, there is no clear relationship between minutes played and the percentage of mid-range shot. Compared to the close range shot, we can see that the top 50 scorers take lesser mid-range shot. In fact, this is the case for the rest of the leauge. Let's look at which top scorer had the highest mid-range percentage.

Unsuprisingly, the player was LaMarcus Aldridge who specialized on post-up mid-range shot throughout his carrer. It is interesting to see that the two players with the highest mid-range shot percentage were in the same team, San Antonio Spurs. The Spurs missed the playoffs for the first time in 22 seasons. It would be interesting to see how Spurs would do in the next season with two of the best mid-range shot killer.

Finally, let's look at the Minutes Played vs 3 point Shot graph

Looks like the players took 3 point just as much as close range shot. Again, there is no clear relationship between minutes played and 3 point shot percentage. I guess the number of minutes played did not affect how the players shot selection. However, as we look at the 3 point shot attempted by the top 50 scorer, we can see that most of them take 0.05 - 0.15 percent of 3 point range shot. But as we look at the percentage for the rest of the league, the shot attempted at this area by the top 50 scorer is comparatively lower. This is interesting because everyone wanted to replicate the success of Golden State Warriors, shooting more threes every season. But the top scorer had a more uniform shot selection. Maybe the teams would need to rethink their strategy in the future of the leauge. Let's discover the top shotters in the leauge.

Surprisingly, Buddy Hield lead the leauge in 3 point shot percentage in 2019-2020 season. This category was donimated by Stephen Curry and Klay Thompson from the Golden State Warriors. I guess this is because of the major injuries for both players from the Warriors, causing them not able to finish the season. However, we can still see the presence of Curry on the list eventhough he only played for 5 games throughout the season. Clearly, this exemplify that Curry is the best shooter of all time.

Player Clustering

Let’s begin by scaling the data. Scaling means to change the range of values without changing the distribution. That is useful because machine learning algorithms work much better when features are on the same scale.

Here is how to scale the data. First, we are going to select the data that is relevent to be scaled. Then, we are going to utilize the StandardScaler avaliable in scikit-learn library to scale our data.

Then, it’s time to find the best number of clusters. In order to do so, I will use the KMeans algorithm and the silhouette score, which is available using scikit-learn.

The silhouette score is a metric that measures the quality of clusters. It ranges between 1 and -1, with a score of 1 meaning that clusters are well apart from each other and clearly different (which is what we want).

So simply put, we want the highest silhouette score possible.

The following loop calculates the silhouette score for every k between 6 and 12. I started the loop at 6 because at 5, it basically classifies players by traditional position.

We see that we should be using a total of 9 clusters to classify NBA players based on the way they play. But note that if you run the KMeans algorithm agian, you will get highest score that is not k=9. Here, we choose k=9 because it is the highest score at this instance. In fact, k=9 has the highest averge score. So, we will let k=9.

Let’s look at those clusters.

We will first label each players into their approriate cluster and group them based on their cluster.

Next, we will calculate the average of the statistics for each cluster. we will select data that is relevent and remove all NaN values. We will then analyze those data and try to come up with names that best describe the clusters. To ease our analysis, I've include a function that would highlight the highest value in each category.

Given this statistics, it looks like it would still difficult for use to name the clusters.

So, let's select a feature player from each cluster based on their +/- (contribution to the team) to help us cross-reference with the statistics above and come up with names for the clusters.

Looking at this, I came up with the following names for the clusters.

This classification is very similar to those mentioned in the article. Therefore, our analysis is pretty consistent with the analysis done by experts in the field.

The overall classification of players will be as follow:

So what makes a good team?

With all that in mind, it can be interesting to see how good teams were composed this year.

Here, I considered that the good teams were the final 8 teams in the playoffs this year. The following code separates the teams and creates a radar plot, using Plotly.

First we will seperate the players from good teams out from our table. The we will analyze the type of players that the good teams have as compared to the rest of the leauge using radar plot that is avaliable in Plotly library.

This radar plot shows a few interesting things. First, we see that bad teams had a lot more players in the "High Usage Bigs" cluster. This type of player does not seem to help in creating a contender especially in a 'Small Ball' era. These Bigs tend to move slower and it is hard for them to guard smaller ball dominant scorers like James Harden. Bad teams had more "Athletic Forwards", "3 & D Players" and "High Usage Guards" compare to the good teams. This is again the 'Warriors' effect that causes every team to go 'small' and try to replicate the success of the Warriors dynasty. However, this is clearly not the way to build a contending team.

Good teams also tend to have more "High Quality Contributors" players (2 vs 1.5 per team). That is not surprising. Having players that belong to the "Floor General" cluster is one part of the equation (and it’s obviously very important). Surrounding them with good shooters that stretch defenses and quality players that can share the load of star players to score, make plays is crucial too. Another thing that is interesting in this plot is that the good teams had more "Low Usage Bigs" that other teams. I guess it is much more benificial to have several "Low Usage Bigs" in rotation to protect the rim than having one "High Usage Bigs" that have to carry all the rim protection load.

Conclusion

There is more than one way to win in the NBA. However, surrounding your star players with the right pieces is a crucial part of it. The traditional positions are not sufficient to classify today's NBA players. All Teams need to find a better way to classify the players, and select the players that would better fit their team. Hopefully, this tutorial gave you some insight into that.

Thanks a lot for reading!